Analysis of missing data patterns (in daily data set)

Vincent Bagilet https://www.sipa.columbia.edu/experience-sipa/sipa-profiles/vincent-bagilet (Columbia University)https://www.columbia.edu/ , Léo Zabrocki https://www.parisschoolofeconomics.eu/en/ (Paris School of Economics)https://www.parisschoolofeconomics.eu/en/
2020-11-06

In this document, we carry out a first analysis of missing data patterns in the daily data set. Graphs and figures displayed in this document are purely informative and not intended for publication in their current state. Most graphs are produced using loops and therefore display standard and uniform settings. Some fine tuning should be performed to produce specific graphs to be published.

First of all, we noticed that there are some issues with the data set and therefore, for now, we restrict the data set to observations not subject to these issues. In particular, we do not have pollution data for NO in 2014, 2015, 2019 and 2020 and very few observations for PMs in 2019. These issue do not come from actual missing data but are due to issue with the package we used to access the pollution data. These data are available through the EEA interface.

Proportion of missing observations in covariates

We fist analyse the proportion of missing observations for each covariate.

Variable Proportion of missing values
City 0.0000000
Concentration 0.0194582
Elevation weather station 0.0000093
Global radiation 0.2730897
Holiday zone 0.0000000
Insolation duration 0.2406190
Latitude weather station 0.0000093
Longitude weather station 0.0000093
Public holiday 0.0000093
Rainfall duration 0.0718760
Rainfall height 0.0009501
Relative humidity 0.0004471
School holiday 0.0000093
Sea level pressure 0.0000838
Temperature 0.0000838
Uv radiation 0.8994490
Wind direction 0.0005775
Wind speed 0.0005542

In order to understand these previous results, we break down the analysis by year and represent the results in a graph, for readability.

In complete case analyses, observations for which any variable is missing are dropped. One may therefore wonder what is the share of dropped observations in a complete case analysis. We only consider variables which are potentially relevant. We also drop the UV radiation variable due to its large share of missingness.

Carrying out a complete case analysis would lead to drop 36.7068834% of the observations. However, one may notice that also dropping insolation duration, global radiations and rainfall duration lead the limiting factor to be concentration data.

Location of measurement stations

The stations considered are located in the 17 biggest largest in France:

City Number of stations
Bordeaux 4
Clermont-Ferrand 13
Dijon 7
Grenoble 5
Le Havre 17
Lille 7
Lyon 22
Marseille 11
Montpellier 7
Nancy 4
Nantes 16
Nice 9
Paris 29
Rennes 9
Rouen 7
Strasbourg 9
Toulouse 19

Missing data patterns in air pollution data

Here, we investigate whether missing pollutant concentration data varies across different dimensions.

The overall share of missing air pollution observations is 0.0324046.

Evolution of the share of missing values with the values of covariates

In this section, we investigate whether the share of missing values varies with the values of covariates. One may expect that, for extreme values of some covariates, such as temperature, wind speed or precipitation level for example, measurement instruments are more likely to be defective, leading to more missing values.

Across pollutants

One can notice that the share of missing values varies across pollutants, up to about a factor two. This highlights the potential necessity of analyzing missingness patterns independently across pollutants.

Pollutant Proportion missing values
no 0.0037740
no2 0.0048729
o3 0.0172493
pm10 0.0610813
pm2.5 0.0522411
so2 0.0702874

Across locations

We look whether missingness patterns vary across location characteristics.

[[1]]


[[2]]

Across dates and time

We then investigate whether the share of missing values evolves with dates and time.

[[1]]


[[2]]


[[3]]


[[4]]


[[5]]


[[6]]


[[7]]

We also explore more closely these patterns for some variables by decomposing them by year, month or pollutant.

Across weather variables

[[1]]


[[2]]


[[3]]


[[4]]


[[5]]


[[6]]


[[7]]


[[8]]


[[9]]


[[10]]

Only consider first missing value

We plot the same graphs as before but only considering hours were the data started missing, not considering later and consecutive missing observations.

As compared to the full sample, the share of missing data decreases since we discarded many observations with missing values (every observation which was not the first observation of their period of missing data). Hence, the share of missing data is not informative in itself, only potential differences in this share across “grouping variables”.

[[1]]


[[2]]


[[3]]


[[4]]


[[5]]


[[6]]


[[7]]


[[8]]


[[9]]


[[10]]


[[11]]


[[12]]


[[13]]


[[14]]


[[15]]


[[16]]


[[17]]


[[18]]

Balance between missing and non missing observations

Balance graphs

We first investigate whether covariates are balanced between observations for which concentration data is missing and non missing.

We can refine this analysis by looking separately across cities.

Distribution of covariates

It might also be interesting to see whether covariates have a similar distribution for observations where data is missing and when it is not.

[[1]]


[[2]]


[[3]]


[[4]]


[[5]]


[[6]]


[[7]]


[[8]]


[[9]]


[[10]]

[[1]]


[[2]]


[[3]]


[[4]]


[[5]]

One may also be interested in looking at these distributions by pollutant. The results are rather similar across all pollutants. We do not display them to avoid overcrowding (even more than already is) the document.

Last value before missing

If data is missing due to external factors, what matters might be the value of these external factors when the data started missing, ie potentially when the sensor first became defective. As a consequence, we look into the distribution of the covariates for the last value before a missing concentration observation.

[[1]]


[[2]]


[[3]]


[[4]]


[[5]]


[[6]]


[[7]]


[[8]]


[[9]]


[[10]]


[[11]]

[[1]]


[[2]]


[[3]]


[[4]]


[[5]]

For concentration, we carry the last value forward in order to see whether missing concentration data is associated with different concentration values, just before the data is missing as compared to when concentration data is not missing. We also filter out high concentration values in order to see the distribution more clearly.

Length of periods with missing observations

In this section, we explore the length of periods with missing observations. This length may provide information on causes of missingness. Missing observations for long periods of time may be indicative of cluttered filters of broken instrument. We also explore whether the length of missingness patterns is correlated with weather variables.

First, we explore the length of missing observations by looking at the displaying, in an heatmap, for each couple city*date, whether concentration data is missing. We break this down into years for readability.

[[1]]


[[2]]


[[3]]


[[4]]


[[5]]


[[6]]


[[7]]

Then, we look at the length of periods with missing data. First, we can either count each the number of periods with a given length (eg 3 periods have a length of missing data of 5 hours/days) or count the number of dates belonging to periods with a given length (considering the same example, 15 dates belong to a period of missing data of length 5 hours/days). We denote the former case “One observation per period” and the later “One observation per date”.

We might be interested in looking at the length of missing periods for different pollutants. The method to measure concentration varies across pollutants and reasons for missing data may depend on the method. Particulate matter is measured with filters which can become cluttered. This could lead to rather long missing periods, with the necessary time to clean the filter. Gaseous pollutants are measured using optical methods and thus not subject to cluttered filters.

Distribution of lengths of missing data

As previously, we look at the distributions considering one observation per missing period and one observation per date. This later case naturally changes greatly the distribution; for instance one series of missing data of 100 hours/days is only accounted for once in the former case but 100 times in the later.

Correlation between missingness length and weather variables

In this section, we investigate whether period length of missing data varies with weather variables. Due to the larger number of observations considered here, instead of looking at a scatter plot, we look at bivariate distribution plots

[[1]]


[[2]]


[[3]]


[[4]]


[[5]]


[[6]]


[[7]]


[[8]]


[[9]]


[[10]]

Then, we look into weather “values” when variables started missing. If missingness is caused by some weather feature, the weather at the time of the first missing observation would be the one to look into.

[[1]]


[[2]]


[[3]]


[[4]]


[[5]]


[[6]]


[[7]]


[[8]]


[[9]]


[[10]]